Chapter 7 Chinese Text Processing

In this chapter, we will discuss one of the most important issues in Chinese language/text processing, i.e., word segmentation. When we discuss tokenization in @ref{tokenization}, it is easy to do the word tokenization in English as the word boundaries in English are more clearly delimited by whitespaces. Chinese, however, does not have whitespaces between characters, which leads to a serious problem for word tokenization.

This chapter is devoted to Chinese text processing. We will look at the issues of word tokenization and talk about the most-often used library, jiebaR, for Chinese word segmentation. Also, we will include several case studies on Chinese text processing.

7.1 Chinese Word Segmenter jiebaR

First, you haven’t installed the library jiebaR, please install:

Now let us take a look at a quick example.

##  [1] "綠黨"     "桃園市"   "議員"     "王浩宇"   "爆料"     "指民眾"  
##  [7] "黨"       "不"       "分區"     "被"       "提名"     "人"      
## [13] "蔡壁如"   "黃"       "瀞"       "瑩"       "在昨"     "6"       
## [19] "日"       "才"       "請辭"     "是"       "為領"     "年終獎金"
## [25] "台灣民眾" "黨"       "主席"     "台北"     "市長"     "柯文"    
## [31] "哲"       "7"        "日"       "受訪"     "時則"     "說"      
## [37] "都"       "是"       "按"       "流程"     "走"       "不要"    
## [43] "把"       "人家"     "想得"     "這麼"     "壞"

To segment a text, you first initialize a segmenter seg1 using worker() and use this segmenter to segment() texts.

There are many different parameters you can specify when you initialize the segmenter worker(). You may get more detail via the documentation ?worker. Some of the important arguments include:

  • user = ...: This argument is to specify the path to a user-defined dictionary
  • stop_word = ...: This argument is to specify the path to a stopword list
  • symbol = FALSE: Whether to return symbols (default is FALSE)
  • bylines = FALSE: Whether to return each word one line at a time

From the above examples, it is clear to see that some of the words are not correctly identified by the current segmenter: 民眾黨, 不分區, 黃瀞瑩, 柯文哲. It is always recommended to include a user-defined dictionary when doing the word segmentation because different corpora may have their own unique vocabulary.

##  [1] "綠黨"     "桃園市"   "議員"     "王浩宇"   "爆料"     "指"      
##  [7] "民眾黨"   "不分區"   "被"       "提名"     "人"       "蔡壁如"  
## [13] "黃瀞瑩"   "在昨"     "6"        "日"       "才"       "請辭"    
## [19] "是"       "為領"     "年終獎金" "台灣"     "民眾黨"   "主席"    
## [25] "台北"     "市長"     "柯文哲"   "7"        "日"       "受訪"    
## [31] "時則"     "說"       "都"       "是"       "按"       "流程"    
## [37] "走"       "不要"     "把"       "人家"     "想得"     "這麼"    
## [43] "壞"

The format of the user-defined dictionary is one word per line. Also, the default encoding of the dictionary is UTF-8. Please note that in Windows, the default encoding of a txt file created by Notepad may not be UTF-8.

Creating a user-defined dictionary may take a lot of time. You may consult 搜狗詞庫, which includes many domain-specific dictionaries created by others. However, it should be noted that the format of the dictionaries is .scel. You may need to convert the .scel to .txt before you use it in jiebaR. To do the coversion automatically, please consult the library cidian.

When you initialize the segmenter, you can also specify a stopword list, i.e., words you do not need to include in the later analyses.

##  [1] "綠黨"     "桃園市"   "議員"     "王浩宇"   "爆料"     "指民眾"  
##  [7] "黨"       "不"       "分區"     "被"       "提名"     "人"      
## [13] "蔡壁如"   "黃"       "瀞"       "瑩"       "在昨"     "6"       
## [19] "才"       "請辭"     "為領"     "年終獎金" "台灣民眾" "黨"      
## [25] "主席"     "台北"     "市長"     "柯文"     "哲"       "7"       
## [31] "受訪"     "時則"     "說"       "按"       "流程"     "走"      
## [37] "不要"     "把"       "人家"     "想得"     "這麼"     "壞"

So far we did not see the parts-of-speech tag provided by the word segmenter. If you need the tags of the words, you need to specify this need when you initialize the worker().

##          n         ns          n          x          n          n          x 
##     "綠黨"   "桃園市"     "議員"   "王浩宇"     "爆料"       "指"   "民眾黨" 
##          x          p          v          n          x          x          x 
##   "不分區"       "被"     "提名"       "人"   "蔡壁如"   "黃瀞瑩"     "在昨" 
##          x          d          v          x          n          x          x 
##        "6"       "才"     "請辭"     "為領" "年終獎金"     "台灣"   "民眾黨" 
##          n         ns          n          x          x          v          x 
##     "主席"     "台北"     "市長"   "柯文哲"        "7"     "受訪"     "時則" 
##         zg          p          n          v         df          p          n 
##       "說"       "按"     "流程"       "走"     "不要"       "把"     "人家" 
##          x          r          a 
##     "想得"     "這麼"       "壞"

The following table lists the annotations of the POS tagsets used in jiebaR:

You can check the dictionaries being used in your current enviroment:

## [1] "/Library/Frameworks/R.framework/Versions/3.5/Resources/library/jiebaRD/dict"
##  [1] "backup.rda"      "hmm_model.utf8"  "hmm_model.zip"   "idf.utf8"       
##  [5] "idf.zip"         "jieba.dict.utf8" "jieba.dict.zip"  "model.rda"      
##  [9] "README.md"       "stop_words.utf8" "user.dict.utf8"
##  [1] "\""  "."   "。"  ","   "、"  "!"  "?"  ":"  ";"  "`"   "﹑"  "•"  
## [13] """  "^"   "…"   "‘"   "’"   "“"   "”"   "〝"  "〞"  "~"   "\\"  "∕"  
## [25] "|"   "¦"   "‖"   "— " "("   ")"   "〈"  "〉"  "﹞"  "﹝"  "「"  "」" 
## [37] "‹"   "›"   "〖"  "〗"  "】"  "【"  "»"   "«"   "』"  "『"  "〕"  "〔" 
## [49] "》"  "《"

When we use segment() as a tokenization method in the unnest_tokens(), it is very important to specify bylines = TRUE in worker(). This setting would make sure that segment() takes a list of text vectors as input and return a list of word vectors as output.

NB: When bylines = FALSE, segment() returns a vector.

## [[1]]
##  [1] "綠黨"     "桃園市"   "議員"     "王浩宇"   "爆料"     "指民眾"  
##  [7] "黨"       "不"       "分區"     "被"       "提名"     "人"      
## [13] "蔡壁如"   "黃"       "瀞"       "瑩"       "在昨"     "6"       
## [19] "日"       "才"       "請辭"     "是"       "為領"     "年終獎金"
## [25] "台灣民眾" "黨"       "主席"     "台北"     "市長"     "柯文"    
## [31] "哲"       "7"        "日"       "受訪"     "時則"     "說"      
## [37] "都"       "是"       "按"       "流程"     "走"       "不要"    
## [43] "把"       "人家"     "想得"     "這麼"     "壞"
##  [1] "綠黨"     "桃園市"   "議員"     "王浩宇"   "爆料"     "指民眾"  
##  [7] "黨"       "不"       "分區"     "被"       "提名"     "人"      
## [13] "蔡壁如"   "黃"       "瀞"       "瑩"       "在昨"     "6"       
## [19] "日"       "才"       "請辭"     "是"       "為領"     "年終獎金"
## [25] "台灣民眾" "黨"       "主席"     "台北"     "市長"     "柯文"    
## [31] "哲"       "7"        "日"       "受訪"     "時則"     "說"      
## [37] "都"       "是"       "按"       "流程"     "走"       "不要"    
## [43] "把"       "人家"     "想得"     "這麼"     "壞"
## [1] "list"
## [1] "character"

7.2 Case Study 1: Word Frequency and Wordcloud

7.3 Case Study 2: Patterns

## [1] "綠黨_n 桃園市_x 議員_n 王浩宇_x 爆料_n ,_x 指民眾_x 黨_n 不_d 分區_n 被_p 提名_v 人_n 蔡壁如_x 、_x 黃_zg 瀞_x 瑩_zg ,_x 在昨_x (_x 6_x )_x 日_m 才_d 請辭_v 是_v 為領_x 年終獎金_n 。_x 台灣_x 民眾_x 黨_n 主席_n 、_x 台北_x 市長_x 柯文_nz 哲_n 7_x 日_m 受訪_v 時則_x 說_zg ,_x 都_d 是_v 按_p 流程_n 走_v ,_x 不要_df 把_p 人家_n 想得_x 這麼_x 壞_a 。_x"
## [1] "我_r 是_v 在_p 測試_vn 一個_x 句子_n"

For more information related to the unicode ranage for the punctuations in CJK languages, please see this SO discussion thread.

After we segment our corpus into inter-punctuation units (IPU), we can make use of the words as well as their parts-of-speech tags to extract the target pattern we are interested: 被 + ... constructions.


Exercise 7.1 Please use the apple_ipu as your corpus and extract Chinese particle constructions of ... 外/內/中. Usually a particle construction like this consists of a landmark NP and the space particle. For example, in 任期內, 任期 is the landmark NP and is the space particle. In this exercise, we will naively believe that the word directly preceding the space particle is our landmark NP head noun. Please extract all concordance lines with these space particles and at the same time identify their SPC and LM, as shown below.

Exercise 7.2 Following Exercise 7.1, please generate a frequency list of the LMs for each spac particle. Show us the top 10 LMs of each space particle and arrange the frequencies of the LMs in a descending order, as shown below.

Exercise 7.3 Following Exercise 7.2, for each space particle, please create a word cloud of its co-occuring LMs based on the top 200 LMs of each particle.

PS: The word frequencies in the word clouds shown below are on a log scale.


7.4 Case Study 3: Lexical Bundles

With word boundaries, we can also analyze the recurrent multiword units in Chinese news. Here let’s take a look at recurrent four-grams. As we discussed in Chapter @ref(#tokenization), a multiword unit can be defined based on at least two metrics:

  • the frequency of the whole multiword unit (i.e., frequency)
  • the number of texts where the multiword unit is observed (i.e., dispersion)

As the default tokenization in unnest_tokens() only works with the English data, we start this task by defining our own ngram_chi() for Chinese n-grams.

This ngram_chi() takes a text (scalar) as an input, and return a vector of n-grams. Most importantly, this function assumes that in the text string, each word token is delimited by a whitespace.

## [1] "這_是"     "是_一個"   "一個_測試" "測試_的"   "的_句子"   "句子_。"
## [1] "這_是_一個_測試"   "是_一個_測試_的"   "一個_測試_的_句子"
## [4] "測試_的_句子_。"
## [1] "這 是 一個 測試 的"   "是 一個 測試 的 句子" "一個 測試 的 句子 。"

We vectorize the function ngram_chi().


Vectorized functions are a very useful feature of R, but programmers who are used to other languages often have trouble with this concept at first. A vectorized function works not just on a single value, but on a whole vector of values at the same time.

In our first defined ngram_chi function, it takes one text vector as an input and process it one at a time. However, we would expect ngram_chi to process a vector of texts (i.e., multiple texts) at the same time and return a list of resulting ngrams vectors at the same time. Therefore, we use Vectorize() as a wrapper to vectorize our function and specifically tell R that the argument text is vectorized, i.e., process each value in the text vector in the same way.


Now we can tokenize our corpus into n-grams using our own tokenization function vngram_chi() and the unnest_tokens(). In this case study, we demonstrate the analysis of four-grams in our Apple News corpus.

We begin by first creating a sentence ID for each IPU of the article. Then we remove all POS tags, and these cleaned versions of IPU are sent to vngram_chi() to extract four-grams.

Now we have all the four-grams in each article, we can create the frequency list of four-grams. In particular, we compute the frequency of the four-grams as well as their dispersion.

Please take a look at the four-grams, both arranged by frequency and dispersion: